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DECISION TREE BASED SPEECH RECOGNITION 

FIELD OF THE INVENTION 

5 This invention relates to speech recognition. The 

invention is particularly useful for, but not 
necessarily limited to, large vocabulary speech 
recognition based upon binary decision trees for 
reducing speech recognition search space. 

10 

BACKGROUND OF THE INVENTION 

A large vocabulary speech recognition system 
recognises many received uttered words. In contrast, a 
15 limited vocabulary speech recognition system is limited 

to a relatively small number of words that can be 
uttered and recognized. Applications for limited 
vocabulary speech recognition systems include 
recognition of a small number of commands or names. 

20 

Large vocabulary speech recognition systems are 
being deployed in ever increasing numbers and are being 
used in a variety of applications. Such speech 
recognition systems need to be able to recognise 
25 received uttered words in a responsive manner without a 

significant delay before providing an appropriate 
response . 

Large vocabulary Speech recognition systems use 
30 correlation techniques to determine likelihood scores 

between uttered words (an input speech signal) and 
characterizations of words in acoustic space. These 
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characterizations can be created from acoustic models 
that do not require training data from one or more 
speakers and are therefore referred to as large 
vocabulary speaker independent speech recognition 
5 systems. 

For a speaker independent large vocabulary speech 
recognition system, a large number of speech models is 
required in order to sufficiently characterise, in 

10 acoustic space, the variations in the acoustic 

properties found in an uttered input speech signal. For 
example, the acoustic properties of the phone /a/ will 
be different in the words "had" and "ban" , even if 
spoken by the same speaker. Hence, phone units, known 

15 as context dependent phones, are needed to model the 

different sound of the same phone found in different 
words . 

A speaker independent large vocabulary speech 

2 0 recognition system typically spends an undesirable 

large portion of time finding matching scores, in the 
art known as the likelihood scores, between an input 
speech signal and each of the acoustic models used by 
the system. Each of the acoustic models is typically 
25 described by a multiple Gaussian probability density 

function (pdf ) , with each Gaussian described by a mean 
vector and a covariance matrix. In order to find a 
likelihood score between the input speech signal and a 
given model, the input has to be matched against each 

3 0 Gaussian. The final likelihood score is then given as 

the weighed sum of the scores from each Gaussian member 
of the model. The number of Gaussians in each model is 
typically of the order of 8 to 64. 
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It is well known that not all Gaussians within a 
speech model generate a high score for a given input 
speech signal. For a Gaussian with mean values 
considerable different from the input signal values, 
5 the score is very close to 0 as the input is at the 

"tail" of the Gaussian distribution. This implies that 
the contribution of such a Gaussian to the overall 
likelihood score will be negligible. Hence, the 
calculation of the likelihood score for a model using 
10 all the Gaussians can be approximated accurately by 

using only a subset of the Gaussians within the model. 

The subset of Gaussians within the model is 
typically selected using a method known as Gaussian 

15 selection in which a subset of the Gaussians in the 

model set is selected for a particular input speech 
signal. The subset, also called a Gaussian shortlist, 
is then used to calculate the likelihood scores for 
each model. However, the Gaussian shortlist is based 

20 upon vector clustering and in order to obtain 

acceptable real time responses, for large vocabulary 
speech recognition systems, the number of clusters must 
be unnecessarily large. 

25 In this specification, including the claims, the 

terms 'comprises', 'comprising' or similar terms are 
intended to mean a non-exclusive inclusion, such that a 
method or apparatus that comprises a list of elements 
does not include those elements solely, but may well 



SUMMARY OF THE INVENTION 

According to one aspect of the invention there is 
provided a method for creating at least one decision 
5 tree for processing a sampled signal indicative of 

speech, the method comprising the steps of: 

providing model sub vectors from partitioned 
statistical speech models of phones, the models 
comprising vectors of mean values and associated 
10 variance values ; 

statistically analyzing at least some of the 
model sub vectors of mean values to provide 
projection vectors indicating directions of 
relative maximum variance between the sub vectors; 
15 calculating projection values for a plurality 

of the projection vectors; 

selecting potential threshold values from 
analysis of a range of projection values; and 

creating the decision tree having decisions 

2 0 to divide the model sub vectors into groups, the 

groups being leaves of the tree, wherein the 
decisions are based upon selected threshold values 
selected from the potential threshold values, the 
selected threshold values being selected by change 
25 in variance between said model sub vectors the 

variance being determined from said mean values 
and associated variance values . 

Preferably, the groups have statistical 

3 0 characteristics defining an acoustical subspace . 



Suitably, the speech models are based on Gaussian 
probability distributions. 



5 



Preferably, the step of statistically analyzing is 
further characterized by the projection vectors being 
calculated by principal component analysis . 

5 Preferably, the potential threshold values are 

selected from a subset of the projection values. 

Suitably, the decisions are based upon an 
inequality calculation. 

10 

Preferably, the inequality calculation relates to 
inequality between a transpose of a selected model sub 
vector multiplied by a projection vector and one of 
said potential threshold values . 

15 

The subset is suitably selected from projection 
vectors having a projection values with greatest 
variance . 

20 Preferably, the potential threshold values are 

determined from a range between a minimum and maximum 
projection values of each of the projection vectors in 
the subset. 

25 Suitably, the potential threshold values are 

determined by dividing the range into evenly spaced sub 
ranges . 

Suitably, the decision tree is a binary decision 

3 0 tree. 

According to another aspect of this invention 
there is provided a method for speech recognition 
comprising the steps of: 
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providing a sampled speech signal processed 
into at least one feature vector representing 
spectral characteristics of a speech signal; 

dividing the feature vector into sub feature 
5 vectors; 

applying each of the sub feature vectors to a 
corresponding decision tree, to obtain groups of 
model sub vectors that are likely to indicate at 
least one phone of the sampled speech signal, the 
10 decision tree being created by analysis of the 

model sub vectors obtained from statistical speech 
models, wherein the decision tree has decisions 
based upon selected threshold values selected from 
potential threshold values, the selected threshold 
15 values being selected by change in variance 

between said model sub vectors the variance being 
determined from said mean values and variance 
values associated with said model sub vectors; 

selecting a plurality of the model sub 
20 vectors from the groups of sub feature vectors to 

thereby identify a shortlist of model sub vectors; 
and 

processing the shortlist to provide a 
transcription of the sampled speech signal. 

25 

Preferably, the transcription is a text version of 
the sampled speech signal. The transcription may 
suitably be a control signal. The control signal may 
for example activate a function on an electronic device 
30 or system. 



Preferably, the decision tree may be created by 
the above method for creating at least one decision 
tree . 



7 



BRIEF DESCRIPTION OF THE DRAWINGS 



In order that the invention may be readily- 
understood and put into practical effect, reference 
will now be made to a preferred embodiment as 
illustrated with reference to the accompanying drawings 
in which: 

Fig. 1 is a schematic block diagram of a 
speech recognition system in accordance with the 
invention; 

Fig. 2 is a flow diagram illustrating a 
method for creating a decision tree for processing 
a sampled signal indicative of speech; and 

Fig. 3 is a flow diagram illustrating a 
method for speech recognition that uses the 
decision tree created by the method of Fig. 2. 



DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT OF THE 
INVENTION 



25 Referring to Fig. 1 there is illustrated a 

schematic block diagram of a speech recognition system 
1 comprising a statistical speech models database 110 
with outputs coupled to inputs of a partitioning module 
120 and a speech recognizer 160. The partitioning 

30 module 120 has an output coupled to an input of a 

threshold value generator 130 that has an output 
coupled to an input of a decision tree creator 140. An 
output of the decision tree creator 140 is coupled to 
an input of a decision tree store 170. The decision 
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tree store 170 has an output coupled to an input of the 
speech recognizer 160. There is also a speech model 
converter 150 having an input for receiving a speech 
signal. The speech model converter 150 has output 
5 coupled to an input of the speech recognizer 160. 

In Fig. 2 there is illustrated a method 200 for 
creating a decision tree for processing a sampled 
signal indicative of speech. After a start step 210 

10 the method 200 includes a providing model sub vectors 

step 220 from partitioned statistical speech models of 
phones. The statistical speech models comprise vectors 
of mean values and associated variance values. In this 
embodiment the statistical speech models are stored in 

15 the statistical speech models database 110 and are 

based on tri-phones modeled by what is known in the 
art as a Hidden Markov Model (HMM) with multiple 
states. Each of the states of the HMM is modeled by a 
multi-mixture Gaussian Probability Density Function. 

2 0 Accordingly the speech models are based on Gaussian 

probability distributions or Gaussian mixtures where 
where the Gaussian mixtures {g jm } are of the form: 

{g. } = {w. u. 2. } - (1) 

25 

where w. is a scalar weight, u. is a mean value 
vector and is a covariance matrix each being for an 
mth gaussian mixture in a jth HMM state. The 
covariance matrix S jni is typically a diagonal matrix 

3 0 with only the leading diagonal having non-zero values 

and can be simplified into a variance vector o~j m . 

If, for instance, the variance vector Oj m and mean 
value vector jj, are both a 39 dimension vectors, then 



9 



the partitioning module 120 at step 220 partitions each 
of the vectors jj, jm and Cj m into three respective model 
sub vectors fi jmlj jx jm2< \x.^ and a jml ,o jm2 , a jm3 . Each of the 
model sub vectors \i. m2i u. m3< o jml ,a jm2 and a jm3 is a 13 

5 dimension vector containing elements from the original 

respective mean value vector (x^ or variance vector Oj m . 
The sub vector consists of the first 13 elements 

from the mean value vector u jm The corresponding sub 
vectors ^_. m2 and |i. jm3 consists respectively of the next 13 

10 elements and the last 13 elements from p, jm The same 

partition method used to partition the mean value 
vector |o, jm is applied to the variance vector Oj m . That 
is, the sub vectors a jm i ,o-j m2 , °jm3 consists respectively 
of the first 13 elements, the next 13 elements and the 

15 last 13 elements of the variance vector Gj m . The 

providing model sub vectors step 220 is applied to all 
the statistical speech models of phones presented in 
the statistical speech models database 110. For 
example, the speech models database may contain 40,000 

20 Gaussian mixtures, which in turn will generate 40,00 0 x 

3 partitions of Gaussian mixtures {g jm } = 120,000 model 
mean value sub vectors from the mean value vectors jx jm 
and another 120,000 model variance sub vectors from the 
variance vectors a jm . It should be noted at this point 

25 that each of the three partitions Gaussian mixtures 

{g jm } corresponds to a decision tree created as 
described below. 

The model sub vectors generated in step 22 0 from 
30 all the speech models in database 110 are then 

statistically analyzed in step 230 to provide 
projection vectors that indicate the directions of 
relative maximum variance between the model mean value 
sub vectors. A statistical analysis method known in 
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the art as Principal Component Analysis as described in 
Chapter 12 (12-1, 12-2) in the S-PLUS Guide to 
Statistical and Mathematical Analysis published by 
StatSci, Seattle, Washington, is used to calculate the 
5 projection vectors. This reference is included 

herewith as part of this specification. In particular, 
Principal Component Analysis is applied for each 
partition of 40, 000 model mean value sub vectors |i. jml 
(J, jm2 \l. m3 according to the equation: 

10 

C = UAU T -(2) 

where C is the covariance matrix of dimension 13 x 
13 computed from the 40,000 mean value sub vectors; U 
15 is a matrix of dimension 13 x 13 with each row of U 

corresponding to a projection vector; and A is a 13 x 
13 diagonal matrix where a value of the i th diagonal 
element (i = 1 to 13) measures the relative variance 
between the sub vectors in the direction associated 

2 0 with the project vector in the i th row of matrix U. The 

diagonal values of A are known in the art as principal 
components and are ranked in descending order. 
Typically, most of the variance between the sub vectors 
can be accounted for by the first 4 principal 
25 components and their corresponding projection vectors. 

Hence only 4 of the 13 projection vectors are chosen 
and thereby provided as an output of the partitioning 
module 120 in step 230. Accordingly, for each of the 
three mean value sub vector partitions |j,.. ml |Ll jin3 

3 0 there are a total of 12 projection vectors. 

A calculating projection values step 240 is then 
effected in which projection values are calculated for 
each of the 12 mean value projection vectors (four per 
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partition) in the threshold value generator 13 0. A 
projection vector is selected and a projection value is 
calculated for each of the corresponding 40,000 mean 
value sub vectors per partition according to the 
equation : 

T 

// jmK W- _ {3) 

Where K = 1, 2, 3 is an index indicating each of 
the 3 partitions and i = 1,2,3,4 is an index indicating 
each of the 4 mean value projection vectors Uj. . 

After the step 240, a test step 250 is effected in 
which the threshold value generator 13 0 checks whether 
or not projection values have been calculated for each 
of the projection vectors of a partition. If not, an 
unprocessed projection vector is selected and applied 
to step 240 for calculating its projection values. 
Otherwise, the method moves to a selecting potential 
threshold values step 2 60, where the projection values 
are analyzed, by the threshold value generator 13 0, in 
order to select potential threshold values from a range 
of projection sub values. 

In the selecting potential threshold values step 
2 60, a potential threshold values are selected for each 
of the mean value projection vectors from analysis of 
the 40,000 projection values per partition. For 
instance, a range of projection sub values between the 
minimum and maximum projection values can be determined 
by dividing the range into evenly spaced sub ranges 
according to the equation: 
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pt + (b + 0-5)( Pr /™" } ) - ( 4 ) 
B 

where p™** and /?™ n are the maximum and minumum 
projection values respectively; K = 1 , 2, 3 is an index 
5 indicating each of the 3 partitions; i = 1,2,3,4 is an 

index indicating each of the 4 projection vectors u i; - 
b = 1,2,...B is an index for a particular sub range; 
and B, typically chosen to be 10, is the total number 
of sub ranges between the minimum and maximum 
10 projection values. Hence, each of the 12 projection 

vectors has 10 associated potential threshold values 
selected from a subset of the projection values with 
greatest variance. 

15 Next, a creating decision tree step 270, is 

effected to create binary decision trees having 
decisions to divide the model sub vectors into groups 
is created in the decision tree creator 140. These 
decisions divide the sub vectors into groups, the 

20 groups being leaves of the trees and the decisions are 

based on selected threshold values selected from the 
potential threshold values in step 260. In particular, 
decisions are based on the following inequality 
calculation : 

25 

x T u { > k t (b) - ( 5 ) 

where x is a selected model sub vector of mean 
values, u x is a projection vector and k. (b) is a 
3 0 potential threshold value associated with the 

projection vector computed in step 2 60 according to 
equation (4) . 
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A binary decision tree is created for each of the 
three partitions using the corresponding 40,000 model 
mean value sub vectors. Each non-leaf node node of the 
5 created decision tree has an associated question of the 

form as in equation (5) . For each non-leaf node, a 
question is selected from the total of 4 projection 
vectors (four per partition) multiplied by 10 threshold 
values to create 40 potential questions. One of the 
10 questions is then selected to maximise the change in 

variance between the sub vectors within the parent node 
and the sub vectors within the left and right child 
nodes . 

15 The variance v of the data in the nth tree node 

is defined as : 

v" =f>g[v n (0] -(6) 

20 where D = 13, is the dimension of the sub vectors. 

v" (i) is the data variance for the i th dimension in the 
sub-vector and is given by the following equation: 

v"(/)= X(<7, 2 (0 + //, 2 (0)/L-( 5>;(*')/D 2 - (7) 

jeU.L j=l...L 

25 

where j is the index of sub vectors; L is the 
number of sub-vectors assigned to the node; 

<J ; (0 and jUj (i) are the i bh dimensional element of the j th 

sub vector mean and standard deviation for the nth node 
3 0 respectively. 
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The change in variance d is then determined by: 
d=v parent -(v 1 ** +v Hsht ) -(8) 

5 

where v parent , v left / v ri9ht represents the variance of the 
sub vectors in the parent, left child and right child 
node respectively. 

10 The decision tree has a number of leaf nodes where 

each leaf corresponds to a group of model sub vectors 
sharing similar statistical characteristics that 
together define an acoustical subspace . 

15 The sub vector in a leaf node satisfies the 

following conditions: 

(1) The number of model sub vectors is less than 
a threshold, chosen to be 10; and 
20 (2) The maximum possible change in variance 

according to equations (6) - (8) is less than 
a threshold, chosen to be 0.1. 

There are three decision trees created in the 
25 decision tree creator 140 at step 270, each 

corresponding to one of the three partitions. Each of 
the non-leaf nodes has a decision associated therewith 
based on the inequality equation -(5), the decision of 
each non-leaf node is selected to maximise change in 
3 0 variance between sub vectors and is of the form: 



x T u i > k t 



-(9) 
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Where x is a feature vector described below, u x is a 
selected projection vector for the node; and k ± is a 
selected threshold value associated with the projection 
vector 

5 

The decision trees are stored in the decision 
tree store 170 and the method 200 terminates at an end 
step 280. 

10 Referring to FIG. 3, there is illustrated a method 

3 00 for speech recognition that uses the decision tree 
created by the method 200. After a start step 310, 
speech recognition commences in which the method 3 00 
first provides, at a providing step 320, a sampled 

15 speech signal from incoming speech utterance that is 

received and processed by the speech model converter 
150. The sampled speech signal represents spectral 
characteristics of the speech signal that is processed 
into one or more feature vectors by the speech model 

20 converter 150. Each feature vector is the same 

dimension (39) as the mean value vector u jm and variance 
vector o"j m of the statistical speech models stored in 
the statistical models database 110. The feature 
vectors represent the spectral characteristics of the 

25 underlying speech signal. For instance, a method known 

in the art as mel-f requency cepstral coefficients 
(MFCCs) is used. A typical known method of finding the 
MFCCs is included herewith by reference to the paper 
"Comparison of Parametric Representations for 

3 0 Monosyllabic Word Recognition in Continuous Spoken 

Sentences." by David and Mermelstein, published in IEEE 
Transactions on Acoustic Speech and Signal Processing, 
Vol. 28, pp. 357 - 366. 
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Next, a dividing feature vector step 33 0 is 
effected in the speech recognizer 160 in which the 
feature vectors are divided into sub feature vectors. 
The identical partition method used in step 220 for the 
5 statistical speech models is used in step 330. In 

particular, each 39 dimension feature vector x is 
divided into three 13-dimension sub feature vectors x 1 , 
x 2 , x 3 that consist respectively of the first 13 
elements, the next 13 elements and the last 13 elements 
10 thereof. 

Each of the sub feature vectors is then applied, 
at an applying step 3 40, to the corresponding one of 
three decision trees in the decision tree store 170 

15 which is accessed by the speech recognizer 160. The 

applying step applies each of the sub feature vectors 
to a corresponding decision tree, to obtain groups of 
model sub vectors that are likely to indicate at least 
one phone of the sampled speech signal. As will be 

20 apparent to a person skilled in the art, each of the 

three decision trees were created by analysis of model 
sub vectors obtained from statistical speech models 
database 110. 

25 The sub feature vector is first applied to the 

root node of the decision tree by evaluating the 
decision of equation (9) associated with the root node. 
The sub feature vector is then assigned to either the 
left or right child node according to the outcome of 

3 0 the evaluation. The decision of equation (9) associated 

with the child node chosen is then evaluated with the 
sub feature vector. The process repeats until a leaf 
node has been reached and a group of model sub vectors 
for the sub feature vector is obtained. The group 
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defines an acoustical subspace that indicates at least 
one phone of the sampled speech signal. 

A test step 350 is then effected to check whether 
5 or not all the sub feature vectors have been applied to 

the corresponding decision tree. If not, an unprocessed 
sub feature vector is selected and applied to its 
decision tree. Otherwise, the method moves to a 
selecting step 360 in which model sub vectors are 
10 selected to identify and create shortlists of sub 

vectors . 

Each of the feature vectors x is now associated 
with three groups of model sub vectors obtained from 

15 each of the three sub feature vectors x x , x 2 , x 3 and 

their corresponding decision tree. A shortlist of 
model vectors is then identified in the selecting step 
3 60 from the model sub vectors in the three groups s ± , 
s 2 and s 3 . In particular, a model vector is evaluated 

20 as for whether its model sub vector belongs to the 

group associated with the feature vector x. If so, a 
score is assigned to the model vector. A model vector 
is selected into the shortlist for feature vector x if 
the total score is greater than a threshold according 

25 to the empirically determined equation: 

s 1 + 0.5 s 2 + 0.5s 2 > 0.9 _(io) 

Where s 1 , s 2 or s 3 are set to 1 if the 
30 corresponding model sub vector is present in their 

group. Otherwise, s 1 , s 2 and s 3 are set to zero. Hence, 
the strategy used to select the shortlist for a feature 
vector x is to include a model vector if the model sub 
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vector is at least in group s 1 or if the model sub 
vector is not in group s 1 then it must be present both 
group s 2 and group s 3 to be selected as a member of the 
shortlist . 

5 

The shortlists identified for the feature vectors 
are then processed in a processing step 370 to provide 
a transcription of the sampled speech signal. This is 
provided by what known in the art as a decoding method. 

10 A typical implementation of a decoding method that is 

included herewith into this specification can be found 
in the publication "A One Pass Decoder Design for Large 
Vocabulary Recognition" by J. J. Odell, V. Valtchev, P. 
C. Woodland and S. J. Young in Proceedings ARPA 

15 Workshop on Human Language Technology, pp. 405 - 410, 

1994. 

The transcription is provided at an output of the 
speech recognizer 160. The transcription in one form 
20 is a text version of the sampled speech signal. 

Alternatively, the transcription may be a control 
signal to activate a function on an electronic device 
or system. The method terminates at an end step 380. 

25 Advantageously, the present invention can 

alleviate the problems with unnecessary processing of 
distribution "tails" of statistical speech models 
during speech recognition. The invention also 

alleviates the overheads associated with unnecessary 

30 large clusters affecting speech recognition response 

times . 

The detailed description provides a preferred 
exemplary embodiment only, and is not intended to limit 
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the scope, applicability, or configuration of the 
invention. Rather, the detailed description of the 
preferred exemplary embodiment provides those skilled 
in the art with an enabling description for 
5 implementing preferred exemplary embodiment of the 

invention. It should be understood that various changes 
may be made in the function and arrangement of elements 
without departing from the spirit and scope of the 
invention as set forth in the appended claims. 



